Work as part of the Collaborative Research Centre 990: Ecological and Socioeconomic Functions of Tropical Lowland Rainforest Transformation Systems (Sumatra, Indonesia)
Many studies attempted to produce land-use/land-cover maps for Jambi province. Nevertheless, these maps do not meet project demands for different reasons:
e.g. (Ekadinata and Vincent 2011; Melati 2017; Nurwanda, Zain, and Rustiadi 2016; Soo Chin Liew, Chew Wai Chang, and Kim Hwa Lim 2003)
In order to serve as reference for further research questions in the project a number of demands have been defined, the produced map should:
Ideally a similar map with same classes can be produced for past years to:
The classification process is done with Google Earth Engine and R. While the reference data was delineated in GEE, images from Bing Maps and Google Earth Pro were also used for visual interpretation of lulc-classes. As model tuning is not possible in GEE the reference data was exported to R in order to find the best model parameters. The classification itself was then done in GEE, using model parameters defined in R.
| Reference Data | Model Tuning | Classification |
|---|
Previous studies on land-cover and land-use in Jambi used different classification keys. Some classes were classified in sub-categories in other studies, while other classes were merged or were not assessed.
Melati (2017) used different classification schemes for different purposes (areas) by merging the 22 classes differentiated by the Ministry of Forestry. Stolle et al. (2003) also used many classes, while Sambodo and Indriasari (2018), Nurwanda, Zain, and Rustiadi (2016), and Ekadinata and Vincent (2011) merged some of the classes into broarder categories.
classes <- read_csv(here("raw_data/classification-classes.csv"), na = "NA")
dt_classes <- classes
names(dt_classes)[1] <- paste0(names(dt_classes)[1],
footnote_marker_symbol(1))
dt_classes %>%
kable(escape = F) %>%
kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), font_size = 12, fixed_thead = T) %>%
column_spec(1:9, width_min = "15em", italic = TRUE) %>%
collapse_rows(columns = 1:9, valign = "top") %>%
scroll_box(width = "100%") %>%
footnote(symbol = "as used in Melati (2017)" )
| Ministry of Forestry* | Melati (2017) I | Melati (2017) II | Melati (2017) III | Stolle et al. (2003) | Sambodo et al. (2018) | Nurwanda et al. (2016) | Ekadinata et al. (2011) | this study |
|---|---|---|---|---|---|---|---|---|
| Primary dryland forest | NA | NA | Primary forest | Natural vegetation | Forest | Forest | Forest | Primary forest |
| Primary swamp forest | NA | NA | ||||||
| Primary mangrove forest | NA | NA | ||||||
| Secondary dryland forest | Secondary dryland forest | Secondary forest | Secondary forest | Logged-over forest | Secondary forest | |||
| Secondary swamp forest | Secondary swamp forest | |||||||
| Secondary mangrove forest | Secondary mangrove forest | Mangrove + Shrubs with trees | ||||||
| Jungle rubber | Jungle rubber | Jungle rubber | Jungle rubber | Small-holder rubber | Rubber | Rubber | Rubber agroforest | |
| Rubber Plantation | Rubber plantation | Rubber plantation | Rubber plantation | Rubber plantation | Monoculture rubber | Plantation forest | ||
| Plantation forest | Plantation forest | ? | Plantation forest | Tree crop mosaic | ? | ? | ? | |
| Oil palm plantation | Oil palm plantation | Oil palm plantation | Oil palm plantation | Oil palm plantation | Oil palm + Coconut plantations | Oil palm plantation | Oil palm plantation | Immature oil palm plantation |
| Mature oil palm plantation | ||||||||
| Oil/rubber plantation | ||||||||
| NA | NA | NA | NA | Coconut plantation | NA | NA | Coconut plantation | |
| Agriculture | Mixed dryland agriculture | NA | Agriculture | Tea plantation | Cropland | Mixed dryland agriculture | NA | Tea plantation |
| NA | Food crop mosaic | NA | NA | |||||
| Dryland agriculture | NA | Secondary/food mosaic | NA | NA | ||||
| Paddy field | NA | Sawah or tidal rice | NA | Rice field | Rice | |||
| Shrub/bush | Shrub/bush | Shrub/bush | Shrub/bush | Others | Shrub | Shrub | Shrub | Shrub/bush |
| Swamp bush | Swamp bush | |||||||
| Bare land | Bare land | Bare land | Others | Bare soil | Bare land | Cleared land | Burned/Cleared | |
| Mining | Mining | |||||||
| Water bodies | Water body | Water body | Water | Water body | NA | Water | ||
| Fishponds | NA | |||||||
| Transmigration areas | Settlement | Settlement | Settlement | Built-up area | Settlement | Urban | ||
| Airport | ||||||||
| Settlements | Homesteads |
As Melati (2017) suggests oil palm plantations were differentiated in mature and immature plantations. Class Rubber was not classified seperately since other studies had no success trying this (e.g. Melati (2017)) and because rubber trees could not be differentiated visually in high resolution imagery for reference data creation.
Reference data is delineated in GEE with support of high resolution imagery from GE Pro and Bing Maps. The following conditions were met for data collection:
sample of reference data collected in GEE
A supervised classification algorithm with good performance should be used in combination with good predictor variables.
input variables
A combination of Sentinel-1 (radar) and Sentinel-2 (optical) imagery is used
classification model
Decision tree models are used widely for classification problems, preliminary tests indicated that Random Forest (RF) algorithm performed better than Classification And Regression Trees (CART). Hence, a RF model was trained to classify Land-use and Land-cover.
classification key
library(readr)
library(tidyverse)
library(caret)
library(kableExtra)
# read reference data exported by google earth engine
ref <- read_csv(here("raw_data/reference/stratified-reference_20190521.csv"),
col_types = cols(.geo = col_skip(),
class = col_factor(levels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "10", "11", "12")),
latitude_209564535 = col_skip(),
longitude_209564535 = col_skip(),
`system:index` = col_skip()))
# split in train and test data
index <- createDataPartition(y = ref$class, p = .7, list = FALSE)
training <- ref[index, ]
testing <- ref[-index, ]
# create random forest model
filename = here("output_data/model/rf_fit.rds")
if (file.exists(filename)){
rf_fit <- readRDS(filename)
} else{
# specify that the resampling method is
fit_control <- trainControl(## 10-fold CV
method = "cv",
number = 10)
# define a grid of parameter options to try
rf_grid <- expand.grid(mtry = c(2, 3, 4, 6, 8),
splitrule = c("gini"),
min.node.size = c(1, 5, 10))
# run a random forest model
set.seed(825)
library(doParallel)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
# fit the model with the parameter grid
rf_fit <- train(class ~ .,
data = training,
method = "ranger",
importance = "impurity" ,
trControl = fit_control,
# provide a grid of parameters
tuneGrid = rf_grid)
stopCluster(cl)
# save model
saveRDS(rf_fit, filename)
}
print(rf_fit)
## Random Forest
##
## 8400 samples
## 19 predictor
## 12 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '10', '11', '12'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 7560, 7560, 7560, 7560, 7560, 7560, ...
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 2 1 0.9273810 0.9207792
## 2 5 0.9261905 0.9194805
## 2 10 0.9238095 0.9168831
## 3 1 0.9285714 0.9220779
## 3 5 0.9273810 0.9207792
## 3 10 0.9257143 0.9189610
## 4 1 0.9275000 0.9209091
## 4 5 0.9261905 0.9194805
## 4 10 0.9248810 0.9180519
## 6 1 0.9246429 0.9177922
## 6 5 0.9267857 0.9201299
## 6 10 0.9241667 0.9172727
## 8 1 0.9241667 0.9172727
## 8 5 0.9261905 0.9194805
## 8 10 0.9235714 0.9166234
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = gini
## and min.node.size = 1.
# ggplot(data= rf_fit$results) +
# geom_line(aes(x=mtry, y=Accuracy, color= as.factor(min.node.size))) +
# geom_point(aes(x=mtry, y=Accuracy, color= as.factor(min.node.size)), size = 3) +
# scale_color_viridis(discrete = TRUE, option = "D", name="Minimal Node Size") +
# labs(x = "Randomly Selected Predictors",
# y = "Accuracy (Cross-Validation)") +
# theme_classic() +
# theme(legend.position="bottom")
rf_fit$results %>%
plot_ly(x= ~mtry) %>%
add_trace(y = ~Accuracy,
mode = 'lines+markers',
color= ~ as.factor(min.node.size), colors = viridis_pal(option = "D")(3),
error_y = ~list(array = AccuracySD),
text = ~paste("Accuracy: ", Accuracy, "<br>AccuracySD:", AccuracySD)
) %>%
layout(xaxis = list(title="Randomly Selected Predictors"),
yaxis = list(title="Accuracy (Cross-Validation)"))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
Gridsearch result for number of predictor variables and minimal node size
Differences are not very pronounced.
simplest model with high accuracy (max 2% difference to best model)
# get simplest model with similar accuracy
whichTwoPct <- tolerance(rf_fit$results, metric = "Accuracy",
tol = 2, maximize = TRUE)
rf_fit$results[whichTwoPct,]
## mtry splitrule min.node.size Accuracy Kappa AccuracySD KappaSD
## 1 2 gini 1 0.927381 0.9207792 0.008566283 0.009345036
In model fitting a relative variable importance is calculated to give an impression which predictor variables are valuable and which are less valuable for the prediction process. Nevertheless, correlation between variables is not taken into account.
# variable importance
imp <- varImp(rf_fit)
plot(imp)
Relative variable importance
The validation data is classified with the best model obtained by gridsearch. Its confusion matrix is calculated as a value of prediction accuracy.
Error Matrix
# validation
testing.pred <- predict(rf_fit, newdata = testing[1:19], probability= F)
cm<- confusionMatrix(data = testing.pred, reference = testing$class)
cm.df <- as.data.frame.matrix(cm$table)
cm.df %>%
mutate_all(~ifelse(. > max(cm.df[1:12])*0.75,
cell_spec(., "html", color = "white", bold = T, background = "green"),
ifelse(. > 0,
cell_spec(., color = "white", bold = T, background = spec_color(., option = "A", direction= -1, begin = 0.3, scale_from = c(0,53))),
cell_spec(., "html", color = "grey")
)
)
) %>%
mutate(class = cell_spec(c("water - 0","primary forest - 1","secondary forest - 2","mature oilpalm - 3", "immature oilpalm - 4","shrubland - 5","plantation forest - 6","burned / cleared - 7","urban buildings - 8","coconut plantation - 10","rice - 11","tea plantation - 12"), "html",bold = T)) %>%
select(class, everything(.)) %>%
kable("html", escape = F, rownames = T, align=c('r', rep('c', 12))) %>%
kable_styling("hover", full_width = F)
| class | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| water - 0 | 300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| primary forest - 1 | 0 | 282 | 9 | 1 | 1 | 0 | 3 | 0 | 0 | 4 | 0 | 0 |
| secondary forest - 2 | 0 | 8 | 284 | 0 | 0 | 0 | 2 | 0 | 0 | 3 | 0 | 0 |
| mature oilpalm - 3 | 0 | 1 | 1 | 293 | 6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| immature oilpalm - 4 | 0 | 0 | 0 | 2 | 293 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| shrubland - 5 | 0 | 0 | 3 | 2 | 0 | 300 | 0 | 0 | 0 | 1 | 0 | 0 |
| plantation forest - 6 | 0 | 0 | 0 | 0 | 0 | 0 | 294 | 0 | 0 | 0 | 0 | 0 |
| burned / cleared - 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 298 | 2 | 0 | 0 | 0 |
| urban buildings - 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 297 | 0 | 0 | 0 |
| coconut plantation - 10 | 0 | 9 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 289 | 0 | 0 |
| rice - 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 300 | 0 |
| tea plantation - 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 300 |
Respective Accuracy
print(cm$overall[1:4])
## Accuracy Kappa AccuracyLower AccuracyUpper
## 0.9805556 0.9787879 0.9754962 0.9848115
To further simplify the prediction model a Recursive Feature Elimitaion (rfe) is applied. This will eliminate worst performing predictor variables (chosen by importance) at each step and keep the best performing variables.
# perform reverse feature selection with all variables
filename = here("output_data/model/rfProfile_all.rds")
if (file.exists(filename)){
rfProfile <- readRDS(filename)
} else{
# normalize data
training_recipe <- recipe(class ~ ., data = training) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_nzv(all_predictors())
train_prepped <-
training_recipe %>%
prep(training) %>%
juice()
# number of features to test
subsets <- c(1:19)
training_ctrl <- rfeControl(
method = "repeatedcv",
repeats = 5,
functions = rfFuncs,
returnResamp = "all"
)
library(doParallel)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
rfProfile <- rfe(x = train_prepped %>% dplyr::select(-class),
y = train_prepped$class,
sizes = subsets,
metric = "Accuracy",
rfeControl = training_ctrl)
stopCluster(cl)
# save model
saveRDS(rfProfile, filename)
}
Model performance by number of features evaluated with Recursive Feature Elimitaion
Chosen variables
## [1] "VH" "B5" "VV_VH" "B11" "VV"
## [6] "B12" "NBRI" "NDVI" "B4" "NDMI"
## [11] "B3" "B6" "SAVI" "NDWI" "B7"
## [16] "VH_variance" "B8A" "B8"
To simplify the model without loosing prediction accuracy we search for a model with less predictor variables, which has the same accury (max 2 % difference in accuracy).
## Variables Accuracy Kappa AccuracySD KappaSD
## 8 8 0.9098095 0.9016104 0.00739575 0.008068091
A model using the following 8 prediction variables instead of all 18 variables has almost the same accuracy.
selectedVars <- rfProfile$variables
bestVar <- rfProfile$control$functions$selectVar(selectedVars, var_num)
bestVar
## [1] "VH" "B5" "VV_VH" "B11" "VV" "B12" "NBRI" "NDVI"
Using the results from previous analysis we train a model with best performing predictor variables and model-hyperparameters.
The predictor variables are:
The hyperparameters are:
Applying the final model to the validation data set the error matrix looks like this.
Error Matrix
library(ranger)
f <- as.formula(paste("class", paste(bestVar, collapse=" + "), sep=" ~ "))
simpler_model <- ranger(formula = f,
data = training,
num.trees = 500,
mtry = 2,
min.node.size = 1,
importance = "impurity")
testing.pred <- predict(simpler_model, testing)
cm<- confusionMatrix(data = testing.pred$predictions, reference = testing$class)
cm.df <- as.data.frame.matrix(cm$table)
cm.df %>%
mutate_all(~ifelse(. > max(cm.df[1:12])*0.5,
cell_spec(., "html", color = "white", bold = T, background = "green"),
ifelse(. > 0,
cell_spec(., color = "white", bold = T, background = spec_color(., option = "A", direction= -1, begin = 0.3, scale_from = c(0,53))),
cell_spec(., "html", color = "grey")
)
)
) %>%
mutate(class = cell_spec(c("water - 0","primary forest - 1","secondary forest - 2","mature oilpalm - 3", "immature oilpalm - 4","shrubland - 5","plantation forest - 6","burned / cleared - 7","urban buildings - 8","coconut plantation - 10","rice - 11","tea plantation - 12"), "html",bold = T)) %>%
select(class, everything(.)) %>%
kable("html", escape = F, rownames = T, align=c('r', rep('c', 12))) %>%
kable_styling("hover", full_width = F)
| class | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| water - 0 | 299 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| primary forest - 1 | 0 | 221 | 38 | 2 | 0 | 0 | 4 | 0 | 0 | 13 | 0 | 2 |
| secondary forest - 2 | 0 | 49 | 229 | 0 | 1 | 0 | 3 | 0 | 0 | 24 | 0 | 0 |
| mature oilpalm - 3 | 0 | 3 | 4 | 278 | 20 | 1 | 0 | 0 | 0 | 3 | 0 | 1 |
| immature oilpalm - 4 | 0 | 1 | 1 | 14 | 273 | 1 | 0 | 1 | 1 | 1 | 1 | 2 |
| shrubland - 5 | 0 | 0 | 5 | 2 | 0 | 284 | 1 | 0 | 0 | 2 | 0 | 3 |
| plantation forest - 6 | 0 | 0 | 1 | 0 | 0 | 0 | 290 | 0 | 0 | 1 | 0 | 0 |
| burned / cleared - 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 291 | 4 | 0 | 4 | 0 |
| urban buildings - 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 293 | 0 | 1 | 0 |
| coconut plantation - 10 | 0 | 26 | 22 | 3 | 6 | 2 | 1 | 1 | 0 | 255 | 0 | 0 |
| rice - 11 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 7 | 2 | 1 | 293 | 0 |
| tea plantation - 12 | 0 | 0 | 0 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 1 | 292 |
Respective Accuracy
print(cm$overall[1:4])
## Accuracy Kappa AccuracyLower AccuracyUpper
## 0.9161111 0.9084848 0.9065748 0.9249651
The model validation provides us with values for overall model performance, but cannot give us regional performance values. Therefore a second classification was produced to inform about per pixel classification certainty. Since classification probability can only be calculated for a two-class classifier and the project focus is put on oilpalm plantations, the lulc-classes in the training dataset were merged into following classes:
A classifier was then trained on this data with its output mode set to probability.
see in GEE —————–
For a lot of areas in Jambi there is no appropriate reference image available. Problems, among others, are:
- no high resolution imagery is available
- high resolution imagery is covered by clouds
- last appropriate image is severeal years old
In many regions of Jambi province, the lulc-pattern is very heterogenious with many different classes densely mixed. In these cases it is hard to collect reference data because lulc areas covered by a single class are very small and chances are high to receive backscatter of mixed classes.
In some cases lulc classes are devided by other classes (e.g. intercropped with trees) which also makes it more difficult to collect proper reference data.
Some classes are hard to differentiate from each other in aerial images. For example:
- Primary forest - Secondary forest
- Secondary forest - Plantation forest
- Oil palm plantation - Coconut plantation
Although most processing steps were done in Google Earth Engine, using high performance server clusters via cloud computing, calculations were problematic. For the calculations vast amounts of data had to be processed. The area of interest (Jambi) is quite large covering many image tiles, for each tile data from almost 2 years was processed from different sources (Sentinel-1 and Sentinel-2) in many steps. Due to these complex processing steps calculations were aborted (timed-out) from time to time and other calculations took several days to complete before results could be evaluated and following steps could be initiated.
Ekadinata, Andree, and Grégoire Vincent. 2011. “Rubber agroforests in a changing landscape: Analysis of land use/cover trajectories in bungo district, indonesia.” Forests Trees and Livelihoods 20 (1): 3–14. doi:10.1080/14728028.2011.9756694.
Melati, Dian Nuraini. 2017. “The use of remote sensing data to monitor land use systems and forest variables of the tropical rainforest landscape under transformation in Jambi Province, Sumatra, Indonesia.” PhD thesis, Göttingen.
Nurwanda, Atik, Alinda Fitriany Malik Zain, and Ernan Rustiadi. 2016. “Analysis of Land Cover Changes and Landscape Fragmentation in Batanghari Regency, Jambi Province.” Procedia - Social and Behavioral Sciences 227 (August). Elsevier BV: 87–94. doi:10.1016/j.sbspro.2016.06.047.
Sambodo, Katmoko Ari, and Novie Indriasari. 2018. “LAND COVER CLASSIFICATION OF ALOS PALSAR DATA USING SUPPORT VECTOR MACHINE.” International Journal of Remote Sensing and Earth Sciences (IJReSES) 10 (1). Indonesian National Institute of Aeronautics; Space (LAPAN). doi:10.30536/j.ijreses.2013.v10.a1836.
Soo Chin Liew, Chew Wai Chang, and Kim Hwa Lim. 2003. “Hyperspectral land cover classification of EO-1 Hyperion data by principal component analysis and pixel unmixing.” In, 3111–3. Institute of Electrical; Electronics Engineers (IEEE). doi:10.1109/igarss.2002.1027101.
Stolle, F., K. M. Chomitz, E. F. Lambin, and T. P. Tomich. 2003. “Land use and vegetation fires in Jambi Province, Sumatra, Indonesia.” Forest Ecology and Management 179 (1-3): 277–92. doi:10.1016/S0378-1127(02)00547-9.
A work by Jens Wiesehahn